Systems and methods for recognizing ambiguity in metadata

ABSTRACT

A method performed at a server system having one or more processors and memory storing one or more programs for execution by the one or more processors includes generating a feature vector that represents a first artist identifier of a plurality of artist identifiers in a first dataset. The feature vector includes a first indication of whether the first artist identifier matches multiple artist entries in one or more second datasets that are distinct from the first dataset. The method includes determining, based at least in part on the first indication, a probability that the first artist identifier is associated with two or more different real-world artists and determining whether the probability satisfies a predetermined probability condition. The method also includes creating, in response to determining that the probability satisfies the predetermined probability condition, a new artist identifier.

RELATED APPLICATION

This application is a continuation of U.S. application Ser. No.14/977,458, filed Dec. 21, 2015, which is a continuation of U.S.application Ser. No. 13/913,195, filed Jun. 7, 2013, issued as U.S. Pat.No. 9,230,218 on Jan. 5, 2016, which claims priority and benefit of U.S.provisional application No. 61/657,678, filed Jun. 8, 2012, each ofwhich is incorporated by reference herein in its entirety.

TECHNICAL FIELD

The disclosed implementations relate generally to recognizing ambiguityin metadata, and more specifically, to recognizing when an artistidentifier is mistakenly associated with multiple artists of the samename.

BACKGROUND

Modern media content providers offer streaming and/or downloadable mediafrom a large content catalog. Indeed, streaming music services may offeraccess to millions of songs. In order to provide the best service totheir customers, content providers offer many different ways for usersto search for and identify content to consume. For example, in thecontext of a streaming music provider, users are able to search forindividual tracks or albums, or to search by artist.

For both the content providers and consumers, it is convenient toassociate each content item with a unique artist identifier, so thattracks by a particular artist can be quickly and easily located.Typically, it is sufficient to apply the same unique artist identifierto all tracks associated with the same “artist name” (which is a commonmetadata field for music tracks). Sometimes, though, different artistshave the same name, which can lead to tracks from multiple artists beingassociated with the same artist identifier in the content catalog. Thiscan make it difficult for users to locate certain tracks or certainartists, and can reduce the visibility of real-world artists who shouldbe separately identified. For example, the name “Prince” is associatedwith both the well-known U.S.-based pop musician and a lesser knownCaribbean artist. Unless such ambiguities are recognized and the catalogis corrected to associate different real-world artists with differentunique artist identifiers, ambiguous artist identifiers will continue toplague search results, leading to user confusion and a generally pooruser experience.

Given the large number of artists in the database, though, it is notfeasible to manually review every artist identifier to ensure that it isnot ambiguous (i.e., associated with content items from a multipledifferent real-world artists). Accordingly, there is a need to provideways to detect artist ambiguity in a large content catalog.

SUMMARY

The implementations described herein use statistical methods todetermine a likelihood that an artist identifier in a content provider'ssystem is mistakenly associated with content items from multiplereal-world artists.

In some implementations, a statistical classifier uses feature vectorsto determine whether an artist identifier is likely to be ambiguous.Feature vectors are composed of features that describe or are derivedfrom various aspects of the tracks, albums, or other metadata associatedwith the artist identifier that are potential indicators of ambiguity.For example, shorter artist names are more likely to be ambiguous thanlonger names. As another example, an artist identifier associated withalbums in two or more different languages can indicate that the artistis likely ambiguous (e.g., because artists are likely to release theiralbums in the same language). These features, as well as othersdescribed herein, are used to populate a feature vector for the artistidentifiers in a database. The statistical classifiers then determine,based on the feature vectors, whether individual artist identifiers arelikely to be ambiguous.

Various statistical classifiers are used in various implementations, asdescribed herein. For example, a logistic regression classifier is usedin some instances, while a naive Bayes classifier is used in others.These classifiers are discussed in greater detail herein.

Once the classifiers have determined that an artist identifier is likelyambiguous, the artist identifier can be flagged or otherwise marked formanual review by a human operator to confirm whether the artist is, infact, ambiguous. The human operator may also identify which contentitems belong to which artist, and assign different artist identifiers toeach real-world artist. In some cases, automated or semi-automated meansmay be used instead of manual review to perform these tasks. Forexample, a content provider will consult a supplemental database that isknown to have unambiguous information and correct the ambiguity based onthe information in the supplemental database (e.g., by creating newartist identifiers and re-associating content items with the correctidentifier).

EXEMPLARY IMPLEMENTATIONS

A method for estimating artist ambiguity in a dataset is performed at anelectronic device having one or more processors and memory storing oneor more programs for execution by the one or more processors. The methodincludes applying a statistical classifier to a first dataset includinga plurality of media items, wherein each media item is associated withone of a plurality of artist identifiers, each artist identifieridentifies a real world artist, and the statistical classifiercalculates a respective probability that each respective artistidentifier is associated with media items from two or more differentreal world artists based on a respective feature vector corresponding tothe respective artist identifier.

Each respective feature vector includes features selected from the groupconsisting of: whether the corresponding respective artist identifiermatches multiple artist entries in one or more second datasets; whethera respective number of countries of registration of media itemsassociated with the corresponding respective artist identifier exceeds apredetermined country threshold; whether a respective number ofcharacters in the corresponding respective artist identifier exceeds apredetermined character threshold; whether a respective number of recordlabels associated with the corresponding respective artist identifierexceeds a predetermined label threshold; whether the correspondingrespective artist identifier is associated with albums in at least twodifferent languages; and whether a difference between an earliestrelease date and a latest release date of media items associated withthe corresponding respective artist identifier exceeds a predeterminedtime span threshold.

In some implementations, each respective feature vector includesfeatures selected from the group consisting of: whether thecorresponding respective artist identifier matches multiple artistentries in one or more second datasets; a respective number of countriesof registration of media items associated with the correspondingrespective artist identifier; a respective number of characters in thecorresponding respective artist identifier; a respective number ofrecord labels associated with the corresponding respective artistidentifier; a respective number of languages of albums associated withthe corresponding respective artist identifier; and a respectivedifference between an earliest release date and a latest release date ofmedia items associated with the corresponding respective artistidentifier.

In some implementations, the statistical classifier is a naive Bayesclassifier. In some implementations, the feature vector used by thenaïve Bayes classifier includes the following features: whether thecorresponding respective artist identifier matches multiple artistentries in one or more second datasets; whether a respective number ofcharacters in the corresponding respective artist identifier exceeds apredetermined character threshold; and whether the correspondingrespective artist identifier is associated with albums in at least twodifferent languages.

In some implementations, the statistical classifier is a logisticregression classifier. In some implementations, the feature vector usedby the logistic regression classifier includes the following features:whether the corresponding respective artist identifier matches multipleartist entries in one or more second datasets; whether a respectivenumber of countries of registration of media items associated with thecorresponding respective artist identifier exceeds a predeterminedcountry threshold; whether a respective number of characters in thecorresponding respective artist identifier exceeds a predeterminedcharacter threshold; and whether the corresponding respective artistidentifier is associated with albums in at least two differentlanguages.

In some implementations, the method includes providing a report of thefirst dataset, including the calculated probabilities, to a user of theelectronic device.

In some implementations, the method includes determining whether eachrespective probability satisfies a predetermined probability condition;and setting a flag for each respective artist identifier that satisfiesthe predetermined probability condition. In some implementations, thepredetermined probability condition is whether the respectiveprobability exceeds a predetermined probability threshold. In someimplementations, the probability threshold is 0.5, 0.9, or any otherappropriate value.

In some implementations, the method includes determining whether therespective probabilities of the respective artist identifiers satisfy apredetermined probability condition; and in response to detecting that aparticular probability of a particular artist identifier satisfies thepredetermined probability condition: creating a new artist identifier;and associating one or more particular media items with the new artistidentifier, wherein the one or more particular media items werepreviously associated with the particular artist identifier.

In some implementations, the method further includes, prior to creatingthe new artist identifier, identifying the one or more media items in asecond dataset by identifying a first artist entry in the second datasetthat is associated with the one or more media items and has a sameartist name as the particular artist identifier, and identifying asecond artist entry in the second dataset that is not associated withthe one or more media items and has the same name as the particularartist identifier.

In accordance with some implementations, a computer-readable storagemedium (e.g., a non-transitory computer readable storage medium) isprovided, the computer-readable storage medium storing one or moreprograms for execution by one or more processors of an electronicdevice, the one or more programs including instructions for performingany of the methods described above.

In accordance with some implementations, an electronic device isprovided that comprises means for performing any of the methodsdescribed above.

In accordance with some implementations, an electronic device isprovided that comprises a processing unit configured to perform any ofthe methods described above.

In accordance with some implementations, an electronic device isprovided that comprises one or more processors and memory storing one ormore programs for execution by the one or more processors, the one ormore programs including instructions for performing any of the methodsdescribed above.

In accordance with some implementations, an information processingapparatus for use in an electronic device is provided, the informationprocessing apparatus comprising means for performing any of the methodsdescribed above.

BRIEF DESCRIPTION OF THE DRAWINGS

The implementations disclosed herein are illustrated by way of example,and not by way of limitation, in the figures of the accompanyingdrawings. Like reference numerals refer to corresponding partsthroughout the drawings.

FIG. 1 is an illustration of a client-server environment, according tosome implementations.

FIG. 2 is a block diagram illustrating a client device, in accordancewith some implementations.

FIG. 3 is a block diagram illustrating a content server, in accordancewith some implementations.

FIG. 4 is a block diagram illustrating an analytic server, in accordancewith some implementations.

FIG. 5A is an illustration of a metadata entry, in accordance with someimplementations.

FIG. 5B is an illustration of a feature vector, in accordance with someimplementations.

FIGS. 6A-6B are flow diagrams illustrating a method of estimating artistambiguity in a dataset, in accordance with some implementations.

DETAILED DESCRIPTION

FIG. 1 illustrates the context in which some implementations of thepresent invention operate. A plurality of users 112 access their clientdevices 102 to run an application 110, which accesses content itemsprovided by the content provider 116. In some implementations, theapplication 110 runs within a web browser 224. The application 110communicates with the content provider 116 over a communication network108, which may include the Internet, other wide areas networks, one ormore local networks, metropolitan networks, or combinations of these.The content provider 116 works with the application 110 to provide userswith content items, such as audio tracks or videos. The content provider116 typically has one or more web servers 104, which receive requestsfrom client devices 102, and provide content items, web pages, or otherresources in response to those requests. The content provider alsoincludes one or more content servers 106, which select appropriatecontent items for users. The data used by the content servers 106 istypically stored in a database 118, including content items 324 andassociated metadata, as described below with respect to FIG. 3. In someimplementations, the database 118 is stored at one or more of thecontent servers 106. In some implementations, the database is arelational SQL database. In other implementations, the data is stored asfiles in a file system or other non-relational database managementsystem

The client device 102 includes an application 110, such as a mediaplayer that is capable of receiving and displaying/playing back audio,video, images, and the like. The client device 102 is any device orsystem that is capable of storing and presenting content items to auser. For example, the client device 102 can be a laptop computer,desktop computer, handheld or tablet computer, mobile phone, digitalmedia player, portable digital assistant, television, etc. Moreover, theclient device 102 can be part of, or used in conjunction with, anotherelectronic device, such as a set-top-box, a stereo or home-audioreceiver, a speaker dock, a television, a digital photo frame, aprojector, a smart refrigerator, a “smart” table, or a media playeraccessory.

In some implementations, the client device 102, or an application 110running on the client device 102, requests web pages or other contentfrom the web server 104. The web server 104, in turn, provides therequested content to the client device 102.

The content items 324 stored in the database 118 include audio tracks,images, videos, etc., which are sent to client devices 102 for access byusers 112. For example, in implementations where the application 110 isa media player, the application 110 may request media content items, andthe content provider 116 sends the requested media content items to theclient device 102.

An analytic server 122 performs statistical analyses on the informationin the database 118 to identify artist identifiers that are likely to beambiguous (e.g., associated with media content from multiple real-worldartists), as described herein. In some implementations, based on thestatistical analyses, the analytic server 122 provides reportsidentifying those artist identifiers that may be ambiguous, so that theycan be reviewed and corrected (manually or automatically, e.g., by theanalytic server 122).

A metadata server 124 is associated with a metadata provider 117, whichprovides curated metadata for media content (e.g., from the metadatadatabase 120). Metadata from the metadata provider 117 can be used bythe service provider 116 to help identify and/or correct ambiguousartist identifiers. In some implementations, the service provider 116uses multiple metadata providers 117 and/or metadata servers 124 toidentify and/or correct ambiguous artist identifiers, as discussedbelow.

FIG. 2 is a block diagram illustrating a client device 102, according tosome implementations. The client device 102 typically includes one ormore processing units (CPUs, sometimes called processors or cores) 204for executing programs (e.g., programs stored in memory 214), one ormore network or other communications interfaces 212, user interfacecomponents 206, memory 214, and one or more communication buses 202 forinterconnecting these components. The communication buses 202 mayinclude circuitry (sometimes called a chipset) that interconnects andcontrols communications between system components. In someimplementations, the user interface 206 includes a display 208 and inputdevice(s) 210 (e.g., keyboard, mouse, touchscreen, keypads, etc.). Insome implementations, the client device 102 is any device or system thatis capable of storing and presenting content items to a user. In someimplementations, the client device 102 is a mobile device, including,but not limited to, a mobile telephone, audio player, laptop computer,handheld or tablet computer, digital media player, portable digitalassistant, or the like. In some implementations, the client device 102is a desktop (i.e., stationary) computer. In some implementations, theclient device is, or is incorporated into, a set-top-box, a stereo orhome-audio receiver, a speaker dock, a television, a digital photoframe, a projector, a smart refrigerator, a “smart” table, or a mediaplayer accessory.

Memory 214 includes high-speed random access memory, such as DRAM, SRAM,DDR RAM or other random access solid state memory devices; and typicallyincludes non-volatile memory, such as one or more magnetic disk storagedevices, optical disk storage devices, flash memory devices, or othernon-volatile solid state storage devices. Memory 214 optionally includesone or more storage devices remotely located from the CPU(s) 204. Memory214, or alternately the non-volatile memory devices(s) within memory214, comprises a non-transitory computer readable storage medium. Insome implementations, memory 214 or the computer readable storage mediumof memory 214 stores the following programs, modules, and datastructures, or a subset thereof: an operating system 216, which includesprocedures for handling various basic system services and for performinghardware dependent tasks; a communications module 218, which connectsthe client device 102 to other computers (e.g., the web server 104, thecontent server 106, etc.) via the one or more communication interfaces212 (wired or wireless) and one or more communication networks 108, suchas the Internet, other wide area networks, local area networks,metropolitan area networks, and so on; a user interface module 220,which receives commands from the user via the input device(s) 210 andgenerates user interface objects in the display device 208; anapplication 110 (e.g., a media player, a game, etc.), which provides oneor more computer-based functions to a user; and a web browser 224, whichallows a user to access web pages and other resources over the web. Insome implementations, the application 110 runs within the web browser224.

The application 110 is any program or software that provides one or morecomputer-based functions to a user. In some implementations, theapplication is a media player. In some implementations, the applicationis a computer game. The application 110 may communicate with the webserver 104, the content server 106, as well as other computers, servers,and systems.

In some implementations, the programs or modules identified abovecorrespond to sets of instructions for performing a function or methoddescribed herein. The sets of instructions can be executed by one ormore processors or cores (e.g., the CPUs 204). The above identifiedmodules or programs (i.e., sets of instructions) need not be implementedas separate software programs, procedures, or modules, and thus varioussubsets of these programs or modules may be combined or otherwisere-arranged in various implementations. In some implementations, memory214 stores a subset of the modules and data structures identified above.Furthermore, memory 214 may store additional modules and data structuresnot described above.

FIG. 3 is a block diagram illustrating a content server 106, accordingto some implementations. The content server 106 typically includes oneor more processing units (CPUs, sometimes called processors or cores)304 for executing programs (e.g., programs stored in memory 314), one ormore network or other communications interfaces 312, an optional userinterface 306, memory 314, and one or more communication buses 302 forinterconnecting these components. The communication buses 302 mayinclude circuitry (sometimes called a chipset) that interconnects andcontrols communications between system components. In someimplementations, the user interface 306 includes a display 308 and inputdevice(s) 310 (e.g., keyboard, mouse, touchscreen, keypads, etc.).

Memory 314 includes high-speed random access memory, such as DRAM, SRAM,DDR RAM or other random access solid state memory devices; and typicallyincludes non-volatile memory, such as one or more magnetic disk storagedevices, optical disk storage devices, flash memory devices, or othernon-volatile solid state storage devices. Memory 314 optionally includesone or more storage devices remotely located from the CPU(s) 304. Memory314, or alternately the non-volatile memory devices(s) within memory314, comprises a non-transitory computer readable storage medium. Insome implementations, memory 314 or the computer readable storage mediumof memory 314 stores the following programs, modules, and datastructures, or a subset thereof: an operating system 316, which includesprocedures for handling various basic system services and for performinghardware dependent tasks; a communications module 318, which connectsthe content server 106 to other computers (e.g., the client device 102,the web server 104, etc.) via the one or more communication interfaces312 (wired or wireless) and one or more communication networks 108, suchas the Internet, other wide area networks, local area networks,metropolitan area networks, and so on; an optional user interface module320, which receives commands via the input device(s) 310 and generatesuser interface objects in the display device 308; a content itemselection module 322, which selects content items 324 for individualusers and/or for Internet radio stations based on one or more criteria;and a database 118, which stores content items 324 and other data usedby the content item selection module 322 and other modules running onthe content server 106.

Each content item 324 includes the playable content 326 (e.g., theactual audio track or video), as well as metadata about the content item324. The metadata includes an artist identifier 327 uniquely identifyingthe real-world artist that produced or is otherwise associated with thecontent item, the title of the content item 328, the name(s) of theartists or group (e.g., singer, band, actor, movie producer, composer,conductor) 330, and other metadata 332 (e.g., genre, album title,International Standard Recording code (“ISRC”), etc.). In someimplementations, the metadata includes metadata stored in an ID3container associated with a content item.

In some implementations, content items 324 are audio tracks, videos,images, interactive games, three-dimensional environments, oranimations.

The database 118 also includes feature vectors which represent eachartist identifier 327 in an n-dimensional vector space. The componentsof the feature vectors (i.e., the individual features included in thefeature vectors) are discussed in herein.

The database 118 also includes a list of users 336, who are typicallyregistered users. This allows the content server to track the likes anddislikes of the users, and thus present users with content items 324that better match a user's likes. In some implementations, the databasestores playlists 338 for each user, which are lists of content items324. A playlist may be either completely constructed by the user orpartially constructed by a user and filled in by the content itemselection module 322 (e.g., by identifying items similar to orcorrelated with content items already in the playlist and/or otherwiseselected by the user). An individual user may have zero or moreplaylists. Some implementations store user preferences 340 provided byeach user. When provided, user preferences may enable the content itemselection module 322 to provide better content item selections. Thedatabase also stores item selection criteria 342. In someimplementations, the criteria are stored for each individual user. Someimplementations enable multiple sets of selection criteria for anindividual user (e.g., for a user who likes to listen to both jazz andclassical music, but at different times). Some implementations supportgroup selection criteria, which can be used independently or inconjunction with personal item selection criteria.

In some implementations, the programs or modules identified abovecorrespond to sets of instructions for performing a function or methoddescribed herein. The sets of instructions can be executed by one ormore processors or cores (e.g., the CPUs 304). The above identifiedmodules or programs (i.e., sets of instructions) need not be implementedas separate software programs, procedures, or modules, and thus varioussubsets of these programs or modules may be combined or otherwisere-arranged in various implementations. In some implementations, memory314 stores a subset of the modules and data structures identified above.Furthermore, memory 314 may store additional modules and data structuresnot described above.

FIG. 4 is a block diagram illustrating an analytic server 122, accordingto some implementations. The analytic server 122 typically includes oneor more processing units (CPUs, sometimes called processors or cores)404 for executing programs (e.g., programs stored in memory 414), one ormore network or other communications interfaces 412, an optional userinterface 406, memory 414, and one or more communication buses 402 forinterconnecting these components. The communication buses 402 mayinclude circuitry (sometimes called a chipset) that interconnects andcontrols communications between system components. In someimplementations, the user interface 406 includes a display 408 and inputdevice(s) 410 (e.g., keyboard, mouse, touchscreen, keypads, etc.).

Memory 414 includes high-speed random access memory, such as DRAM, SRAM,DDR RAM or other random access solid state memory devices; and typicallyincludes non-volatile memory, such as one or more magnetic disk storagedevices, optical disk storage devices, flash memory devices, or othernon-volatile solid state storage devices. Memory 414 optionally includesone or more storage devices remotely located from the CPU(s) 404. Memory414, or alternately the non-volatile memory devices(s) within memory414, comprises a non-transitory computer readable storage medium. Insome implementations, memory 414 or the computer readable storage mediumof memory 414 stores the following programs, modules, and datastructures, or a subset thereof: an operating system 416, which includesprocedures for handling various basic system services and for performinghardware dependent tasks; a communications module 418, which connectsthe analytic server 122 to other computers (e.g., the content server106, the metadata server 124, etc.) via the one or more communicationinterfaces 412 (wired or wireless) and one or more communicationnetworks 108, such as the Internet, other wide area networks, local areanetworks, metropolitan area networks, and so on; an optional userinterface module 420, which receives commands via the input device(s)410 and generates user interface objects in the display device 408; ananalysis module 422, which performs statistical analyses on the contentsof the database 118 (including, e.g., artist identifiers, content items,metadata, etc.) to identify artist identifiers that are likely to beambiguous (e.g., associated with media content from multiple real-worldartists); and an optional reporting module 424, which produces reportsidentifying those artist identifiers that may be ambiguous, so that theycan be reviewed and corrected (either automatically, e.g., by theanalytic server 122, or manually by a human operator).

In some implementations, the programs or modules identified abovecorrespond to sets of instructions for performing a function or methoddescribed herein. The sets of instructions can be executed by one ormore processors or cores (e.g., the CPUs 404). The above identifiedmodules or programs (i.e., sets of instructions) need not be implementedas separate software programs, procedures, or modules, and thus varioussubsets of these programs or modules may be combined or otherwisere-arranged in various implementations. In some implementations, memory414 stores a subset of the modules and data structures identified above.Furthermore, memory 414 may store additional modules and data structuresnot described above.

As described above, the statistical classifiers used to identifypotentially ambiguous artist identifiers operate on feature vectors. Afeature vector is an n-dimensional vector associated with an artistidentifier. They are populated with data that describes and/or isderived from content items or metadata of content items associated withthe artist identifier. In particular, the components of a feature vectorare values corresponding to particular features of the artistidentifier, where the features provide some indication as to whether anartist identifier is ambiguous.

Feature vector components may be binary (e.g., having a value of 0 or1), integers, decimals, etc. In some implementations, feature vectorsare composed entirely of binary components. In some implementations,feature vectors are composed of a combination of binary and integercomponents.

An exemplary feature represented in a feature vector is whether anartist name associated with an artist identifier matches multiple artistentries in a supplemental metadata database (referred to herein as an“artist match” feature). For example, if the artist name “Prince” isassociated with only one artist identifier in the database 118 of thecontent provider, but is associated with two (or more) different artistentries in a supplemental metadata database (e.g., metadata database120), then the artist identifier of the content provider is likely to beambiguous. In particular, because another source of metadata indicatesthat there is more than one artist with the same name, that is areasonable indication that there are, in fact, multiple real-worldartists with that name.

In some implementations, multiple instances of the artist match featureare used in a feature vector, where each instance corresponds to adifferent supplemental metadata database. By consulting multiplesupplemental metadata databases, the likelihood of identifying ambiguousartist identifiers increases. For example, different databases may havemetadata for different artists (or may have metadata of varying qualityor completeness), so including artist match features derived frommultiple databases will increase the likelihood that ambiguities will bedetected. Moreover, any given metadata database may suffer from the sameartist ambiguity problem as the primary database (e.g., the database 118of the content provider 116). Thus, by including features derived frommultiple databases, the likelihood of correctly identifying ambiguousartist identifiers increases.

In some implementations, the artist match feature is binary. Forexample, a value of 1 indicates that the artist name associated with anartist identifier matches multiple artist entries in a supplementalmetadata database, and a value of 0 indicates that it does not.

In some implementations, the artist match feature is represented as aninteger, where the integer represents the number of different artists inthe supplemental metadata database that match the artist name associatedwith the artist identifier. This may increase the resolution of thefeature vector, because an artist name that matches a greater number ofartists in the supplemental database may be more likely to be ambiguousthan one that matches fewer artists. For example, if an artist name fromthe service provider's database matches six artists in a supplementalmetadata database, the probability that the artist name is ambiguous isgreater than if it matched only two artists in that supplementalmetadata database. The higher resolution offered by the integerrepresentation of the artist match feature may result in more accurateidentification of potentially ambiguous artist names.

Another exemplary feature is whether a number of countries ofregistration of media items associated with an artist identifiersatisfies a predetermined condition (i.e., a “country count” feature).Specifically, artist identifiers that are associated with tracksregistered in many different countries are more likely to be ambiguousthan those with tracks registered in fewer countries. One possiblereason for this is that artists may tend to register their recordings intheir primary country of residence. Thus, when an artist identifier isassociated with multiple different countries of registration, it may bemore likely that the artist identifier is mistakenly associated withtracks from multiple artists.

In some implementations, the predetermined condition is whether thenumber of countries of registration exceeds a predetermined threshold.In some implementations, the country count feature is binary, and thevalue of the feature is 1 if the threshold is exceeded and 0 if thethreshold is not exceeded. In some implementations, the predeterminedthreshold is 1, such that if a number of countries of registrationassociated with an artist identifier is two or more, the feature has avalue of 1. In some implementations, the predetermined threshold is 2,3, 4, 5, or any other appropriate number.

In some implementations, the country count feature is represented as aninteger, where the integer represents the number of different countriesof registration associated with the artist identifier. Accordingly, agreater integer value may indicate a higher likelihood that the artistidentifier is ambiguous.

Another exemplary feature is whether a number of characters in an artistname associated with an artist identifier satisfies a predeterminedcondition (i.e., a “name length” feature). In some cases, this featureindicates artist ambiguity because shorter names are more likely to beshared by multiple artists than longer names. For example, the artistname “Yes” is more likely to be ambiguous than “Red Hot Chili Peppers.”

In some implementations, the predetermined condition is whether thenumber of characters in the artist name exceeds a predeterminedthreshold. In some implementations, the name length feature is binary,and the value of the feature is 1 if the threshold is not exceeded and 0if the threshold is exceeded. In some implementations, the threshold is13 characters. In some implementations, the threshold is eightcharacters. In some implementations, the threshold is any appropriateinteger representing a number of characters (e.g., between 0 and 100).

In some implementations, the name length feature is represented as aninteger, where the integer represents the number of characters in theartist name associated with the artist identifier. In such cases, agreater integer value may indicate a lower likelihood that the artistidentifier is ambiguous.

Another exemplary feature is whether the number of record labelsassociated with an artist identifier satisfies a predetermined condition(i.e., a “label count” feature). This feature can indicate artistambiguity because artists are typically associated with relatively fewrecord labels, and when multiple artists are mistakenly associated withone artist identifier, the number of record labels is higher thanexpected.

In some implementations, the predetermined condition is whether thenumber of record labels associated with the artist identifier exceeds apredetermined label threshold. In some implementations, the label countfeature is binary, and the value of the feature is 1 if the threshold isexceeded and 0 if the threshold is not exceeded. In someimplementations, the threshold is two labels. In some implementations,the threshold is three labels. In some implementations, the threshold isany appropriate integer representing a number of record labels (e.g.,between 0 and 100).

In some implementations, the label count feature is represented as aninteger, where the integer represents the number of record labelsassociated with the artist identifier. In such cases, a greater integervalue may indicate a higher likelihood that the artist identifier isambiguous.

Another exemplary feature is whether an artist identifier is associatedwith albums in at least two different languages (i.e., a “multilingual”feature). In some cases, this feature indicates artist ambiguity becauseartists tend to release all of their albums in a single language. Thus,if an artist identifier is associated with albums in two or moredifferent languages, it is likely that the artist identifier ismistakenly associated with albums of multiple real-world artists.

In some implementations, languages of albums are determined based on thelanguages of the track titles associated with the album. For example,natural language processing techniques may be used to determine thelanguage of the track titles.

In some implementations, the language of the album is determined bydetermining (e.g., using natural language processing) the language ofeach individual track in an album, and selecting as the language of thealbum the language associated with the most individual tracks. Forexample, if an album has eight tracks, and five of them are in English,the language of the album is determined to be English. As anotherexample, if an album has eight tracks, and four are in English, two arein French, and two are in Spanish, the language of the album isdetermined to be English.

In some implementations, the language of the album is determined bycombining the track names into a single text string and guessing (e.g.,using natural language processing) the language of the text string as awhole. For example, all individual track titles are concatenated into asingle text string (e.g., with each track title separated by a delimitersuch as a period, comma, semicolon, space, etc.), and the language ofthis single text string is guessed using natural language processingtechniques. The language of the album is then determined to be thelanguage that was guessed for the concatenated text string.

Examples of natural language processing techniques for guessinglanguages of song titles are discussed in “A comparison of languageidentification approaches on short, query-style texts,” by ThomasGottron and Nedim Lipka, published in European Conference on InformationRetrieval, 2010, which is hereby incorporated by reference in itsentirety.

In some implementations, the multilingual feature is binary, and thevalue of the feature is 1 if the artist identifier is associated withalbums in at least two languages and 0 if the artist identifier is notassociated with album in at least two languages.

In some implementations, the multilingual feature is represented as aninteger, where the integer represents the number of album languagesassociated with the artist identifier. In such cases, a greater integervalue may indicate a higher likelihood that the artist identifier isambiguous.

Another exemplary feature is whether the artist identifier is associatedwith media items having release dates that satisfy a predeterminedcondition (i.e., a “time span” feature). This feature can indicateambiguity because artists identifiers that are associated with mediacontent items released over a longer time period may be more likely tobe ambiguous. Specifically, relatively few artists have long careers, soif an artist identifier is associated with media items that are releasedacross multiple decades, for instance, that artist is more likely to beambiguous than one that is associated with media items spanning ashorter time.

In some implementations, the predetermined condition is whether adifference between an earliest release date and a latest release date ofmedia items associated with an artist identifier exceeds a predeterminedthreshold. In some implementations, the time span feature is binary, andthe value of the feature is 1 if the threshold is exceeded and 0 if thethreshold is not exceeded. In some implementations, the threshold is 20years. In some implementations, the threshold is 10, 15, 25, 30, 35, or50 years, or any appropriate number of years (e.g., between 0 and 100).

In some implementations, the time span feature is represented as aninteger, where the integer represents the number of years between theearliest release date and the latest release date of media itemsassociated with the artist identifier. In such cases, a greater integervalue may indicate a higher likelihood that the artist identifier isambiguous.

The particular parameters and/or thresholds of the features describedabove may be determined empirically, for example, by analyzing artistmetadata from a training set including a set of artists known to beunambiguous and of a set of artists known to be ambiguous. By analyzingthe training set, threshold values that best indicate artist ambiguitycan be determined. For example, for a given training set, it may bedetermined that none of the ambiguous artists in a particular trainingset had artist names longer than 13 characters. Thus, as describedabove, the threshold for a name length feature may be set at 13characters. Similar calculations are made for each feature that requiresa threshold determination to in order to determine its value for a givenartist identifier.

Furthermore, binary features are described above as having a value of 1to indicate likely ambiguity and a value of 0 to indicate likelynon-ambiguity, and the conditions are described such that a “true”outcome is indicative of ambiguity. However, different implementationsmay use different conventions. For example, in some implementations,values of 0 indicate likely ambiguity and values of 1 indicate likelynon-ambiguity. Moreover, in some implementations, it is not necessarythat each feature be calculated according to the same convention.Accordingly, within a single feature vector, a value of 1 can indicatelikely ambiguity for some features, and likely non-ambiguity for otherfeatures. (Feature vectors should, however, be consistent across adataset.)

FIG. 5A illustrates exemplary metadata 502 for an artist identifier,according to some implementations. In some implementations, the metadata502 is stored in the database 118 of the content provider 116. In someimplementations, the metadata is stored in the file container for theunderlying media content (e.g., an ID3 container).

The metadata 502 includes an artist identifier 504. In this example, theartist identifier is the name “Prince,” though it need not correspond tothe name of the artist. For example, the artist identifier may be anyunique identifier (e.g., any alphanumeric string).

The metadata 502 also includes items 506, 508, and 510, corresponding tocontent items (e.g., music tracks) that are associated with the artistidentifier 504 in the database 118. As shown in FIG. 5B, items 506 and508 correspond to songs by the U.S. pop artist Prince: “Raspberry Beret”and “Little Red Corvette.” Item 510 corresponds to a song by theCaribbean artist named Prince: “Missing You.”

Items 506, 508, and 510 each include metadata entries for artist name,title, country code, and record label. In some implementations, othermetadata is included as well, such as album name, genre, track length,year, etc. (not shown).

FIG. 5B illustrates an example feature vector 512 for an artistidentifier, according to some implementations. The feature vector 512includes features x₁, x₂, X₃, X₄, and x₅, each corresponding to one ofthe features described above. As shown, the feature vector 512 includesthe following features: artist match; country count; name length; labelcount; and multilingual.

FIG. 5B also illustrates the feature vector 512 populated with valuesderived from the metadata 502 in FIG. 5A, according to someimplementations. In some implementations, the populated feature vector513 is generated by a computer system (e.g., the analytic server 122) aspart of a process for determining the probability that artistidentifiers in a database (e.g., the database 118) are ambiguous.

The first component of feature vector 512 is an artist match feature. Inthis example, the artist match feature is binary, where a value of 1indicates that the artist name associated with an artist identifiermatches multiple artist entries in a supplemental metadata database, anda value of 0 indicates that it does not. In this example, the populatedfeature vector 513 includes a value of 1 for this feature, illustratinga case where the artist name “Prince” is found to be associated withmore than one artist in a supplemental database, such as the metadatadatabase 120, FIG. 1.

The next component of feature vector 512 is a country count feature. Inthis example, in some implementations, the country count feature isbinary, where a value of 1 indicates that the number of countries ofregistration associated with an artist identifier is two or more.Because the metadata 502 indicates that the artist identifier Prince isassociated with two different country codes (“US” for the United Statesand “JM” for Jamaica), the value of this feature is 1.

The next component of feature vector 512 is a name length feature. Inthis example, the name length feature is binary, where a value of 1indicates that the length of the “artist name” is less than 13characters. Because the metadata 502 illustrates that the artist namehas only six characters, the value of this feature is 1.

The next component of feature vector 512 is a label count feature. Inthis example, the label count feature is binary, where a value of 1indicates that the artist identifier is associated with more than twolabels. Because the metadata 502 indicates three different record labelsassociated with the artist identifier, the value of this feature is 1.

Another component of feature vector 512 is a multilingual feature. Inthis example, the multilingual feature is binary, where a value of 1indicates that the artist identifier is associated with albums in two ormore languages. Because metadata 502 illustrates that all of the tracknames are in English (e.g., only 1 language), the value of this featureis 0.

In some implementations, feature vectors similar to feature vector 512are created for each artist identifier in the database 118. The analyticserver 122 then processes the feature vectors with a statisticalclassifier (such as a naive Bayes or a logistic regression classifier)to determine the likelihood that each artist identifier is ambiguous.

In some implementations, different combinations of the featuresdescribed above are used to populate feature vectors for each artistidentifier in a dataset. The particular features in a feature vector maydepend on the type of classifier that will operate on the featurevector. For example, in some implementations, feature vectors to beprocessed by a naive Bayes classifier (discussed below) include anartist match feature, a name length feature, and a multilingual feature.In some implementations, the feature vectors for processing by the naiveBayes classifier include multiple artist match features, eachcorresponding to a respective determination of whether the artist namematches multiple artist entries in a different respective supplementaldatabase.

In some implementations, feature vectors to be processed by a logisticregression classifier (discussed below) include an artist match feature,a country count feature, a name length feature, and a multilingualfeature. In some implementations, the feature vectors for processing bythe logistic regression classifier include multiple artist matchfeatures, each corresponding to a determination of whether an artistname matches multiple artist entries in a different supplementaldatabase.

In some implementations, feature vectors to be processed by thestatistical classifiers include other combinations of the features,including any of those described above, or others not described.

Various statistical classifiers may be advantageously used to processfeature vectors to calculate probabilities that artist identifiers areambiguous. Two exemplary statistical classifiers are a naive Bayesclassifier and a logistic regression classifier.

In some implementations, a naive Bayes classifier for determining artistambiguity takes the form

$\begin{matrix}{{P\left( a \middle| x \right)} = \frac{{P(a)} + {P\left( x \middle| a \right)}}{P(x)}} & {{Equation}\mspace{14mu} (A)}\end{matrix}$

-   -   where    -   a is an artist identifier;    -   x is a feature vector;        -   P(a) is the probability that the artist identifier a is            associated with media items from two or more different real            world artists;        -   P(x) is the probability that an artist identifier has a            particular feature vector x;        -   P(x|a) is the probability that the feature vector x is            observed for an artist identifier a given that it is known            the artist identifier a is associated with media items from            two or more different real world artists; and        -   P(a|x) is the probability that the artist identifier a is            associated with media items from two or more different real            world artists given that the feature vector x is observed.

In some implementations, a logistic regression classifier fordetermining artist ambiguity takes the form

$\begin{matrix}{{P\left( a \middle| x \right)} = \frac{e^{{\sum\limits_{i = 1}^{n}{\beta_{i}x_{i}}} + \beta_{0}}}{1 + e^{{\sum\limits_{i = 1}^{n}{\beta_{i}x_{i}}} + \beta_{0}}}} & {{Equation}\mspace{14mu} (B)}\end{matrix}$

where

-   -   a is an artist identifier;    -   x is a feature vector of the form (x₁, x₂, . . . , x_(n));    -   ß₀ and ß₁ are constants; and

P(a|x) is the probability that the artist identifier a is associatedwith media items from two or more different real world artists giventhat the feature vector x is observed.

In some implementations, an analytic server (e.g., the analytic server122) processes feature vectors of artist identifiers using one or bothof the classifiers to estimate the probability that the artistidentifiers are ambiguous. Methods for estimating probabilities usingstatistical classifiers are described below with respect to FIGS. 6A-6B.

FIGS. 6A-6B are a flow diagrams of an exemplary method 600 forestimating artist ambiguity in a dataset, in accordance with someimplementations. In some implementations, the method 600 is performed atan electronic device with one or more processors or cores and memorystoring one or more programs for execution by the one or moreprocessors. For example, in some implementations, the method 600 isperformed at the analytic server 122 of the content provider. While themethod is described herein as being performed by the analytic server122, the method may be performed by other devices in addition to orinstead of the analytic server 122, including, for example, the contentserver 106 or the client device 102. The individual steps of the methodmay be distributed among the one or more computers, systems, or devicesin any appropriate manner.

The analytic server applies a statistical classifier to a first dataset(e.g., the database 118) including a plurality of media items (602).Each media item in the dataset is associated with one of a plurality ofartist identifiers, and each artist identifier identifies a real worldartist. As noted above, an artist identifier is any identifier (e.g.,text, words, numbers, etc.) that uniquely identifies a single real-worldartist within the dataset. A real-world artist is an entity (e.g., band,person, group, etc.) that created and/or recorded the particular mediaitem.

In some implementations, media items are music tracks. In someimplementations, media items are movies, videos, pictures, podcasts,audio books, television shows, spoken-word recordings, etc.

The statistical classifier applied in step (602) calculates a respectiveprobability that each respective artist identifier is associated withmedia items from two or more different real-world artists, based on arespective feature vector corresponding to the respective artistidentifier (i.e., the probability that the respective artist identifieris ambiguous).

In some implementations, the probability that a respective artistidentifier is ambiguous is represented as a probability estimate havinga value within a range of possible values, where the value specifies howambiguous the artist identifier is. For example, in some cases, theprobability estimate is represented as a value y, where 0≥y≥1. In someimplementations, a value of 1 indicates the highest probability that theartist is ambiguous, and 0 represents the lowest probability that theartist is ambiguous. Other scales and/or ranges may be used in variousimplementations. For example, the probability estimate may berepresented as a value between 0 and 100, 1 and 10, −1 and +1, or anyother appropriate range.

In some implementations, the probability that a respective artistidentifier is ambiguous is represented as a binary result: the result ofthe classifier (and/or a program applying the classifier) indicates thatartist identifier is likely ambiguous (e.g., corresponding to a value of1 or “true”), or that it is not likely ambiguous (e.g., corresponding toa value of 0 or “false”). Where the statistical classifier producesprobability values within a range, as described above, a binary resultis calculated by determining whether the value satisfies a particularthreshold value. In some implementations, the threshold is set at 50% ofthe range of possible values (e.g., a value of 0.5 in implementationswhere probability estimates range from 0 to 1). In some implementations,other threshold values are used, such as 40%, 75%, 80%, 100%, or anyother percentage of the range of probability values.

Returning to FIG. 6A, in some implementations, the statisticalclassifier is a naive Bayes classifier (603). In some implementations,the naive Bayes classifier takes the form of Equation A, as describedabove.

In some implementations, the statistical classifier is a logisticregression classifier (605). In some implementations, the logisticregression classifier takes the form of Equation B, as described above.

In some implementations, the analytic server provides a report of thefirst dataset, including the calculated probabilities, to a user of theelectronic device (606). The report indicates what artist identifiersare likely ambiguous. A human operator may then review the potentiallyambiguous artist identifiers to correct any ambiguities (e.g., bycreating an additional artist identifier for media items that aremistakenly associated with a particular artist identifier).

Each respective feature vector includes features selected from the groupconsisting of (608): whether the corresponding respective artistidentifier matches multiple artist entries in one or more seconddatasets (e.g., an “artist match” feature, described above); whether arespective number of countries of registration of media items associatedwith the corresponding respective artist identifier exceeds apredetermined country threshold (e.g., a “country count” feature,described above); whether a respective number of characters in thecorresponding respective artist identifier exceeds a predeterminedcharacter threshold (e.g., a “name length” feature, described above);whether a respective number of record labels associated with thecorresponding respective artist identifier exceeds a predetermined labelthreshold (e.g., a “label count” feature, described above); whether thecorresponding respective artist identifier is associated with albums inat least two different languages (e.g., a “multilingual” feature,described above); and whether a difference between the earliest releasedate and the latest release date of media items associated with thecorresponding respective artist identifier exceeds a predeterminedduration threshold.

In some implementations where the statistical classifier is a naïveBayes classifier, the feature vector includes the following features: an“artist match” feature, a “name length” feature, and a “multilingual”feature.

In some implementations where the statistical classifier is a logisticregression classifier, the feature vector includes the followingfeatures: an “artist match” feature, a “country count” feature, a “namelength” feature, and a “multilingual” feature.

[Turning to FIG. 6B, in some implementations, the analytic serverdetermines whether each respective probability satisfies a predeterminedprobability condition (610).

In some implementations, the predetermined probability condition iswhether the respective probability exceeds a predetermined probabilitythreshold (612). The threshold is any appropriate value, and depends, atleast in part, on the range of probability values produced by thestatistical classifier. In some implementations, the threshold is 0.5,0.8, 0.9, or any other appropriate value.

Thereafter, the analytic server sets a flag for each respective artistidentifier that satisfies the predetermined probability condition (614).Accordingly, the dataset can be sorted and/or filtered using the flagsto identify and/or display artist identifiers that are likely ambiguous.A human operator can then review the potentially ambiguous artistidentifiers and take appropriate actions to correct any errors. Forexample, a human operator may create a new artist identifier,disassociate media items from an incorrect artist identifier, andassociate the media items with the new artist identifier.

Some or all of these tasks may be automated so that they do not need tobe manually performed by a human operator. For example, a human operatormay simply identify those media items that are associated with the wrongartist identifier, and instruct the analytic server 122 (or any otherappropriate computer system or device) to perform a disambiguatingroutine that creates a new artist identifier, associates the identifiedmedia items with the new artist identifier, and disassociates theidentified media items from the incorrect artist identifier.

In some implementations, the analytic server determines whether therespective probabilities of the respective artist identifiers satisfy apredetermined probability condition. For example, if the probabilitiesare represented in binary form, the condition may be that theprobability is equal to 1 (e.g., indicating that the artist identifieris likely ambiguous). If the probabilities are represented asprobability estimates having values within a range of possible values(e.g., 0≥y≥1), the condition may be that the probability estimate meetsor exceeds a predetermined probability threshold. In someimplementations, the threshold is 0.5, 0.6. 0.9, or any otherappropriate value.

In response to detecting that a particular probability of a particularartist identifier satisfies the predetermined probability condition, theanalytic server creates a new artist identifier and associates one ormore particular media items with the new artist identifier, where theone or more particular media items were previously associated with theparticular artist identifier. For example, the analytic serveridentifies a group of media items that are mistakenly associated withthe particular artist identifier, and associates that group of mediaitems with a newly created artist identifier.

In some implementations, prior to creating the new artist identifier,the analytic server identifies the one or more media items that are tobe associated with the new artist identifier by consulting a seconddataset (e.g., the metadata database 120) to determine which media itemsshould be associated with which artist identifiers.

For example, the analytic server can identify, in the second dataset,all of the media items that are associated with the likely ambiguousartist identifier in the first dataset. The analytic server can thendetermine which media items should be grouped together under differentartist identifiers. Specifically, in some implementations, the analyticserver (or a third-party server associated with the second dataset)identifies a first artist entry in the second dataset that is associatedwith the one or more media items and has a same artist name as theparticular artist identifier, and identifies a second artist entry inthe second dataset that is not associated with the one or more mediaitems and has the same name as the particular artist identifier.

In some implementations, the features for the feature vector used in anyof the implementations described herein are selected from a differentset of features than are described in step (608). For example, insteadof being represented as binary values (e.g., whether or not a particularfeature satisfies a predetermined threshold), at least a subset of thefeatures are represented as integer values. Specifically, in someimplementations, each respective feature vector includes featuresselected from the group consisting of: whether the correspondingrespective artist identifier matches multiple artist entries in one ormore second datasets; a respective number of countries of registrationof media items associated with the corresponding respective artistidentifier (e.g., an “country count” feature represented as an integervalue); a respective number of characters in the correspondingrespective artist identifier (e.g., an “artist name” feature representedas an integer value); a respective number of record labels associatedwith the corresponding respective artist identifier (e.g., an “labelcount” feature represented as an integer value); a respective number oflanguages of albums associated with the corresponding respective artistidentifier (e.g., a “multilingual” feature represented as an integervalue); and a respective number of years between the earliest releasedate and the latest release date of media items associated with thecorresponding respective artist identifier (e.g., a “time span” featurerepresented as an integer value).

In some implementations, the features for the feature vector areselected from any combination of features described herein, includingthose represented as binary values and those represented as integervalues.

The foregoing description, for purpose of explanation, has beendescribed with reference to specific implementations. However, theillustrative discussions above are not intended to be exhaustive or tolimit the disclosed ideas to the precise forms disclosed. Manymodifications and variations are possible in view of the aboveteachings. The implementations were chosen and described in order tobest explain the principles and practical applications of the disclosedideas, to thereby enable others skilled in the art to best utilize themin various implementations with various modifications as are suited tothe particular use contemplated.

Moreover, in the preceding description, numerous specific details areset forth to provide a thorough understanding of the presented ideas.However, it will be apparent to one of ordinary skill in the art thatthese ideas may be practiced without these particular details. In otherinstances, methods, procedures, components, and networks that are wellknown to those of ordinary skill in the art are not described in detailto avoid obscuring aspects of the ideas presented herein.

It will also be understood that, although the terms “first,” “second,”etc. may be used herein to describe various elements, these elementsshould not be limited by these terms. These terms are only used todistinguish one element from another. For example, a first server couldbe termed a second server, and, similarly, a second server could betermed a first server, without changing the meaning of the description,so long as all occurrences of the “first server” are renamedconsistently and all occurrences of the “second server” are renamedconsistently.

Further, the terminology used herein is for the purpose of describingparticular implementations only and is not intended to be limiting ofthe claims. As used in the description of the implementations and theappended claims, the singular forms “a”, “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will also be understood that the term “and/or” as usedherein refers to and encompasses any and all possible combinations ofone or more of the associated listed items. It will be furtherunderstood that the terms “comprises” and/or “comprising,” when used inthis specification, specify the presence of stated features, integers,steps, operations, elements, and/or components, but do not preclude thepresence or addition of one or more other features, integers, steps,operations, elements, components, and/or groups thereof.

Finally, as used herein, the term “if” may be construed to mean “when”or “upon” or “in response to determining” or “in accordance with adetermination” or “in response to detecting,” that a stated conditionprecedent is true, depending on the context. Similarly, the phrase “ifit is determined (that a stated condition precedent is true)” or “if (astated condition precedent is true)” or “when (a stated conditionprecedent is true)” may be construed to mean “upon determining” or “inresponse to determining” or “in accordance with a determination” or“upon detecting” or “in response to detecting” that the stated conditionprecedent is true, depending on the context.

What is claimed is:
 1. A method, comprising: at a server system havingone or more processors and memory storing one or more programs forexecution by the one or more processors: generating a feature vectorthat represents a first artist identifier of a plurality of artistidentifiers in a first dataset, the feature vector including a firstindication of whether the first artist identifier matches multipleartist entries in one or more second datasets that are distinct from thefirst dataset; determining, based at least in part on the firstindication, a probability that the first artist identifier is associatedwith two or more different real-world artists; determining whether theprobability satisfies a predetermined probability condition; andcreating, in response to determining that the probability satisfies thepredetermined probability condition, a new artist identifier.
 2. Themethod of claim 1, further including: prior to creating the new artistidentifier, consulting the one or more second datasets to identify oneor more particular media items that are to be associated with the newartist identifier.
 3. The method of claim 2, wherein the one or moreparticular media items were previously associated with the first artistidentifier.
 4. The method of claim 2, wherein identifying the one ormore particular media items includes identifying all media items in theone or more second datasets that are associated with the first artistidentifier in the first dataset.
 5. The method of claim 4, whereinidentifying all the media items in the one or more second datasets thatare associated with the first artist identifier includes: identifying afirst artist entry in the one or more second datasets that is associatedwith one or more media items and has a same artist name as the firstartist identifier; and identifying a second artist entry in the one ormore second datasets that is not associated with the one or more mediaitems and has the same artist name as the first artist identifier. 6.The method of claim 2, further including: associating the one or moreparticular media items with the new artist identifier.
 7. The method ofclaim 1, wherein the new artist identifier is associated with areal-world artist in the first dataset.
 8. The method of claim 1,wherein determining whether the probability satisfies the predeterminedprobability condition includes determining whether the probabilityexceeds a predetermined probability threshold.
 9. The method of claim 1,wherein the first dataset is associated with a media content providerand the one or more second datasets are supplemental databasesassociated with one or more metadata providers distinct from the mediacontent provider.
 10. The method of claim 1, wherein the first artistidentifier is associated with metadata in the first dataset including afirst artist name and one or more media items associated with the firstartist name.
 11. The method of claim 10, wherein the metadata associatedwith the first artist identifier in the first dataset further includes atitle, a country code, a record label, genre, track length, and/or year.12. The method of claim 1, further comprising: receiving a user inputidentifying the first artist identifier as ambiguous; and creating thenew artist identifier in response to receiving the user inputidentifying the first artist identifier as ambiguous.
 13. The method ofclaim 1, wherein: the feature vector includes a second indicationselected from the group consisting of: (i) an indicator of whether anumber of countries of registration of media items associated with thefirst artist identifier satisfies a country threshold; (ii) an indicatorof whether a number of characters in the first artist identifiersatisfies a character threshold; (iii) an indicator of whether a numberof record labels associated with the first artist identifier satisfies alabel threshold; (iv) an indicator of whether the first artistidentifier is associated with albums in at least two differentlanguages; and (v) an indicator of whether a difference between anearliest release date and a latest release date of media itemsassociated with the first artist identifier satisfies a time-spanthreshold; and the probability that the first artist identifier isassociated with two or more different real-world artists is determinedbased at least in part on the first indication and the secondindication.
 14. The method of claim 13, wherein: the second indicationis the indicator of whether the number of countries of registration ofmedia items associated with the first artist identifier satisfies thecountry threshold; and the method further comprises, at the serversystem, determining the country threshold based on artist metadata froma training dataset that includes a first set of artist identifiers thatare known to be unambiguous and a second set of artist identifiers thatare known to be ambiguous.
 15. The method of claim 13, wherein: thesecond indication is the indicator of whether the number of charactersin the first artist identifier exceeds the character threshold; and themethod further comprises, at the server system, determining thecharacter threshold based on artist metadata from a training datasetthat includes a first set of artist identifiers that are known to beunambiguous and a second set of artist identifiers that are known to beambiguous.
 16. The method of claim 13, wherein: the second indication isthe indicator of whether the number of record labels associated with thefirst artist identifier exceeds the label threshold; and the methodfurther comprises, at the server system, determining the label thresholdbased on artist metadata from a training dataset that includes a firstset of artist identifiers that are known to be unambiguous and a secondset of artist identifiers that are known to be ambiguous.
 17. The methodof claim 1, wherein determining the probability includes applying astatistical classifier to the feature vector to calculate theprobability.
 18. The method of claim 17, wherein the statisticalclassifier is a naïve Bayes classifier or a logistic regressionclassifier.
 19. A server system comprising: one or more processors; andmemory storing one or more programs for execution by the one or moreprocessors, the one or more programs including instructions for:generating a feature vector that represents a first artist identifier ofa plurality of artist identifiers in a first dataset, the feature vectorincluding a first indication of whether the first artist identifiermatches multiple artist entries in one or more second datasets that aredistinct from the first dataset; determining, based at least in part onthe first indication, a probability that the first artist identifier isassociated with two or more different real-world artists; determiningwhether the probability satisfies a predetermined probability condition;and creating, in response to determining that the probability satisfiesthe predetermined probability condition, a new artist identifier.
 20. Anon-transitory computer readable storage medium storing one or moreprograms, the one or more programs comprising instructions which, whenexecuted by a server system with one or more processors, cause theserver system to: generate a feature vector that represents a firstartist identifier of a plurality of artist identifiers in a firstdataset, the feature vector including a first indication of whether thefirst artist identifier matches multiple artist entries in one or moresecond datasets that are distinct from the first dataset; determine,based at least in part on the first indication, a probability that thefirst artist identifier is associated with two or more differentreal-world artists; determine whether the probability satisfies apredetermined probability condition; and create, in response todetermining that the probability satisfies the predetermined probabilitycondition, a new artist identifier.